Carsten
Görg, Zhicheng Liu, Neel Parekh, Kanupriyah Singhal, John Stasko
School of Interactive Computing and GVU Center
Georgia Institute of Technology
Student team: [ ] YES
[ X ] NO
If you answered yes, name the faculty who agreed to be your sponsor:
We used the Jigsaw system being developed at Georgia Tech as
part of the
is implemented in Java and provides multiple views of the documents in a
collection as well as the entities
within those documents. Its specific
focus is to illuminate connections between entities across the documents.
More information about Jigsaw can be found at
http://www.cc.gatech.edu/gvu/ii/jigsaw.
An initial paper about the system appears at VAST 2007.
Data set used:
[ ] RAW DATA SET [ X ] PRE-PROCESSED SET
TOC:
Who – What – Where – Debriefing - Process - Video
Name
|
Associated organization
|
Involved in
|
Involved in terrorist activities?
(Yes/No)
|
Most relevant source
files (5 MAX)
|
Rosalind
Baptista
|
AJL?
|
Yes
|
Yes
|
Meeting
image, hunt8 image, ChinchillaDreamin
|
Catherine
“Collie” Carnes
|
SPOMA
|
Yes
|
No
|
200301013_4,
20030526-2_57, 20030818_23, ChinchillaDreamin
|
Faron
Gardner
|
AJL
|
Yes
|
Yes
|
20030602-1_66,
20030609_4, 20030818_23, ChinchillaDreamin
|
Cesar
Gil (in blog aka chinshopes)
|
AJL?
|
Yes
|
Yes
|
20030609_4,
20030901-1_36, 20040705_86, ChinchillaDreamin
|
Abu
Hassan (aka Assan)
|
Global
Ways, Assan Circus
|
Yes
|
No
|
200301013_4,
20031215-1_91, 20040301-1_75, ImportPermitsv3
|
Madhi Kim
|
Global
Ways
|
Yes
|
No
|
20030526-2_57,
20040308_109, 200412-2_13, ImportPermitsv3
|
Mercurio Navarro
|
Global
Ways
|
Yes
|
Yes
|
Meeting
image, Tropical fish spreadsheet
|
r’Bear
|
rapper
|
No
|
No
|
20030609_7,
20040119-1_98, 20040308_109, 20040614_94, 20040412-2_13, 20040628_61
|
Luella Vedric
|
socialite
|
No
|
No
|
200301013_4,
20030526-2_57, 20040119-1_98, 20040412-2_13
|
|
|
|
|
|
|
Date |
Event description |
Most relevance
source files (5 Max) |
1 |
10/27/2003 |
Complains about tropical fish importer ‘ |
20030127_57 |
2 |
|
People get sick handling tropical catfish imports
through |
20040105-1_58 |
3 |
|
Vedric and r’Bear attend SPOMA dinner, r’Bear gets
cool reception |
20040119-1_98 |
4 |
|
Madhi Kim visits r’Bear at Shravaana |
20040308_109 |
5 |
|
Chinchillas infected with monkeypox |
ChinchillaDreamin |
6 |
|
Vedric and r’Bert (r’Bear?) attend benefit in |
20040412-2_13 |
7 |
|
Chinchillas multiply and are distributed |
ChinchillaDreamin |
8 |
|
Monkeypox from chinchillas affects people |
ChinchillaDreamin |
9 |
|
r’Bear taken to hospital, potentially with
monkeypox |
20040628_61 |
10 |
|
Seven people reported sick with monkeypox in LA |
20040705_83 |
11 |
|
Gil writes on his blog about chinchillas and
monkeypox |
ChinchillaDreamin |
20 max |
|
|
|
|
Location |
Description |
Most relevance
source files (5 Max) |
1 |
|
Chinchillas with monkeypox released |
20030602-1_66, 20040705_83 |
2 |
Shravaana |
R’Bear’s animal preserve near |
20040308_109, 20040628_61 |
3 |
|
Tropical fish arrive and people get sick when
handling the packaging |
20040105-1_58 |
4 |
Global Ways |
Importer of exotic animals |
200301027_57, 20040105-1_58, 20040308_109,
20040412-2_13, ImportPermitsv3 |
5 |
|
|
|
Luella
Vedric is supposedly a supporter of animal rights causes and an animal
advocate. She is good friends with Collie
Carnes who is the director of SPOMA, an animal rights organization. Vedric and rapper r’Bear, incorrectly
identified as r’Bert in one report, have attended a number of the same benefits
for the cause of animal rights. At one benefit, r’Bear did donate $80,000 to SPOMA,
but he received a cool reception by the audience because many animal
conservationists believe that a ranch he is starting has problems.
r’Bear’s
ranch or animal preserve is outside
Back in the
States, both Vedric and r’Bear attended another benefit about wine and exotic
tropical fish as the guests of Madhi Kim.
This is curious because these are supposedly two people who are strong
animal supporters and they are the guests of this individual who has a
questionable background in that respect.
As a further connection,
In late
June, r’Bear showed up with a serious illness that could be monkeypox or a
similar disease. He has bumps on his
face which are consistent with something like monkeypox. We know that r’Bear’s preserve Shravaana does
have chinchillas on it, and chinchillas have been connected with monkeypox
cases in
In an
online blog, animal rights activist and chinchilla breeder Cesar Gil has
comments and notes about chinchillas and monkeypox. Writings and cartoons on the blog make us
suspicious that Gil may have been involved in the outbreak through the
chinchillas. Gil also notes that he is
friends with Collie (presumably Collie Carnes) and Faron. Faron is likely Faron Gardner who is with the
Animal Justice League (AJL) which has been linked to attacks and violence
before.
On his
blog, Gil writes of Senorita Baptista passing along 6 chinchillas. We believe this is Rosalind Baptista and we
have photos and intelligence linking Baptista to chinchilla smuggling. In one photo, a meeting of RB and MN is
noted. RB is likely Rosalind Baptista
and MN could be Mercurio Navarro, who is connected to (manager of)
One
potential hypothesis about what occurred is that Vedric did not know about
Kim’s questionable background. She may
have mentioned her interactions with r’Bear and Kim to her friend Collie
Carnes. Carnes is connected to Cesar Gil
and the AJL who then smuggled in chinchillas tainted with monkeypox. Through zoonosis these animals transmitted
the disease to humans in LA and some were given to r’Bear as well.
We
recommend further investigations into Cesar Gil and his potential connections
to Collie Carnes. We also recommend
close scrutiny of
Our system Jigsaw does not have capabilities for
finding themes or concepts in a document collection. Instead, it acts more as a visual index,
helping to show which documents are connected to each other and which are relevant
to a line of investigation being pursued.
Consequently, we began working on the problem by dividing the news
report collection into four pieces (for the four people on our team doing the
investigation). Each of us skimmed the
350+ reports in our own unique subset just to become familiar with general
themes discussed in those documents. We
also jotted down notes about potential people, organizations or events to study
further.
Next, we came together and used Jigsaw to examine the
entire news report collection. Jigsaw
expects an xml file as input with the file identifying the unique documents and
entities in the documents. We wrote a
translator that would change the text reports and the pre-identified entities from
the contest data set into the xml form that Jigsaw can read. We then ran Jigsaw and explored a number of
the potential leads that we each identified by our initial skim of the
reports. What we looked for at first
were connections across entities, essentially the same people, organizations or
incidents being discussed in multiple reports.
Jigsaw provides multiple views of the documents and entities so it is
extremely advantageous to have a lot of screen real estate. In Figure 1 below, we show the workstation where
we conducted our investigations. It has
four monitors.
Figure 1: View of the workstation configuration for our
investigations with Jigsaw. Having so
many pixels to work with is a big advantage.
Surprisingly, there was relatively little in the way
of connections across entities in the documents. After about 6 or 7 hours of exploration, we
really had no solid leads, just many, many possibilities. So we went back and some of us read sets of
reports that we hadn’t looked at before.
At that point, we began to identify some potential “interesting”
activities. What was clear here was that
the time we spent exploring the documents in Jigsaw was not wasted time. It helped us become more familiar with many
different things going on in the reports.
Thus, new more deliberate examinations and readings of the documents
began to turn up more promising leads.
We began to find connections across some actors and organizations in the
data set.
We were curious, however, why those connections did
not show up in Jigsaw initially. Upon
returning to the system, we learned why.
Some of the key entities in the plot we uncovered (r’Bear, Madhi Kim,
At this point, we decided that we needed to update
the entity information across the document collection. We started with the pre-identified entities
and we wrote some programs that would scan all the text documents and identify
places where these entities simply were missed.
This process resulted in adding more than 6000 new entity-to-document matches
over the whole collection and the entity-connection-network became much more
dense. The drawback of this technique
was that we also added more noise by multiplying unimportant or wrongly
extracted entities. Therefore, we
manually checked the most frequent entities for validation and made a list of
false positive entities (wrongly classified or extracted) for each entity
type. We excluded these entities from
the document collection and we manually added previously unidentified entities
that we noticed while reading the documents.
We also removed the report date from the list of date entities for a
document. Instead, we stored it as a special
publication date field for the document.
This whole process provided us with a consistent connection network that
was mostly cleaned up for false positives.
Since only one quarter of the entities across the entire collection appeared
in more than one report, we added an option in Jigsaw that allows the user to
filter out all entities that appear in only one report. Doing so allows the user to focus on highly
connected entities at the beginning of the investigation and to add further
entities when more specific questions arise later during the analysis.
Next we resumed exploring the documents using
Jigsaw. Now, it was much easier for us
to track down different plot threads and explore relationships between actors
and events. Figure 2 shows the main
window of Jigsaw that allows the analyst to query for entities, substrings of
entities, or to search for words/expressions in documents. It also shows the color scheme that is used
in the graph and text views to encode entity types. (For all our figures, click on the image on
this page to reveal a larger figure that is more readable.)
Figure 2: Jigsaw main window.
On our second read of the news reports, we noticed
one mentioning the rapper r’Bear being taken to the hospital with bumps on his
face. This seemed suspicious so we
explored r’Bear in Jigsaw’s graph visualization. Below in Figure 3, this is shown. Documents are the larger white circles and
the different types of entities are the smaller colored circles. By expanding the reports with r’Bear in it,
many other “interesting” entities surface such as Shravaana and Madhi Kim.
Figure 3: Graph view begun by loading r’Bear, then
showing connecting documents and expanding those documents to show included
entities.
Next we would turn to the text view (shown below in
Figure 4) and examine these reports. In
our text view, the entities are highlighted.
We cannot stress enough how important it is to simply read the reports
carefully. What Jigsaw is helpful with
is identifying a small subset of reports on related topics that can be examined
carefully. By looking at the reports
about r’Bear, we noticed the connections to Luella Vedric.
Figure 4: The set of reports relevant to r’Bear with one
in focus showing the document text and identified entities.
Below in Figure 5, we started with a search on Luella
Vedric and then we expanded the documents in which she appears to show the
entities also appearing in those reports.
Double-clicking on an entitiy such as Vedric makes the connecting
documents appear, then double-clicking on those documents draws out their
contained entities around the document.
Figure 5: Exploration starting with Luella Vedric and
exploring the documents in which she appears.
We found Vedric’s connections to Catherine (Collie)
Carnes and examined the text reports about her.
This is where we noted the mention of the Assan Circus (shown below in
Figure 6) which led to further investigations.
By exploring the entity “Assan” we found reports mentioning the Abdul
Hassan alias. Manual exploration of the
importer/exporter spreadsheet file found the connection between Hassan and
Reading the reports about Vedric also made us notice
the mention of musician “r’Bert” that we presume is r’Bear but is simply
incorrectly reported or documented.
Figure 6: Report with Vedric that mentions friend Carnes
and refers to the Assan circus.
Carnes was also mentioned in a report with Faron
Gardner, so we investigated him too. In
Figure 7 below, Jigsaw’s List view is shown.
Here we have selected
Figure 7: Jigsaw’s List view showing connections between
Cesar Gil and Faron Gardner.
At various times in the investigation, we wanted to
get a handle on the chronology of events we were focusing on. Jigsaw’s timeline view, shown below in Figure
8, shows a report as a tower of entities positioned at its correct
point (publication date) on the
timeline. To the right is the focus view
on one particular report. By sweeping
out a region in one timeline (shown here in dark yellow), that portion of the
timeline is reproduced on the next timeline up in more detail. In the figure below this has been done twice.
Figure 8: Jigsaw’s timeline view. This view shows some of the events involving
r’Bear and Madhi Kim.
One technique we used a great deal in our
investigations with Jigsaw was to gather a large set of potentially “interesting”
reports into the graph view and then expand all the reports to show all their
entities. Next, by clicking the “Do
Layout” button in the upper left, all these reports are drawn out along a
circle in the view. Entities connecting
to only one report are drawn outside the circle, and entities connecting to
more than one report are drawn inside.
Thus the set of entities inside the circle shows a kind of
interconnected network of entities that should be examined much more
closely. By clicking on one of these entities
and selecting it, the documents in which it appears will be brought into one
Jigsaw’s text views (shown earlier) and they can be read carefully. Figure 9 below shows such a set of
interesting reports for the contest data.
Note the entities on the inside; many of which are involved in the
solution we propose.
Figure 9: Use of the “Do Layout” command in the graph
view. All entities connecting to more
than one document are drawn in the middle making it easier to focus on them.
Below in Figure 10 is a final graph view where we
have filtered out all but the most important entities and documents with
respect to our solution and we have carefully positioned the different reports
and entities to make their connections a little more clear. So this really is more of a documenting or
explanatory view, not one that we would encounter during investigation.
Figure 10: A final cleaned-up view that
could be used as documentation helping to tell the analysis story of this
investigation.
Again, we cannot emphasize strongly enough how
important the process of carefully reading the reports is. Obviously, the problem with the contest data
is that there are about 1500 reports.
Jigsaw is very helpful for exploring different entities in its graphical
views and then having it load a small subset of the relevant documents in one
of its text views. We frequently found ourselves
exploring different entities and we would have 4 or 5 different Jigsaw text
views open, each with only a few documents inside. We could then carefully examine those reports
and it was easy to understand the connections between entities and how the
pieces began to fit together.
Working in this way also underlined the absolute
importance in our exploration environment: the four displays we would run the
system on. We simply need many pixels to
spread out all the different document views.
Performing this exploration on one display would be extremely slow and
burdensome because it would require so much window flipping.
Our analysis activities exposed a number of shortcomings in the Jigsaw system and thus the activities
functioned very much in a formative evaluation sense. We made a number of changes to each view in
our system as we were working on the contest.
Probably the key missing feature in the system at this time is the
ability to identify or remove entities while running the system and doing
active investigations. We plan to add
that capability soon.
TOC:
Who – What – Where – Debriefing - Process - Video